A User-triggered Checkpointing Library for Computationintensive Applications
نویسندگان
چکیده
We propose a method to incorporate coordinated checkpointing and rollback in high performance computing applications on massively parallel computers. A library allows the user to specify which data-items (including files) belong to the contents of the checkpoint, and to trigger the checkpointing in the application. The recovery-line management on the distributed disk system takes care of which recovery-lines are valid, (and may be used for rollback), and of which are obsolete (and may be deleted). This flexible approach provides a frame to incorporate non-blocking, continuebefore-validate checkpointing in the application. Besides, the main advantages are the hardware independence and flexibility.
منابع مشابه
New User-Guided and ckpt-Based Checkpointing Libraries for Parallel MPI Applications
We present design and implementation details as well as performance results for two new parallel checkpointing libraries developed by us for parallel MPI applications. The first one, a user-guided library requires from the programmer to support packing and unpacking code with an easy-to-use API using MPI constants. It uses MPI-2 collective I/O calls or a dedicated master process for checkpointi...
متن کاملDynamic Malleability in MPI Applications
Malleability enables a parallel application’s execution system to split or merge processes modifying the parallel application’s granularity. While process migration is widely used to adapt applications to dynamic execution environments, it is limited by the granularity of the application’s processes. Malleability empowers process migration by allowing the application’s processes to expand or sh...
متن کاملMalleable iterative MPI applications
Malleability enables a parallel application’s execution system to split or merge processes modifying granularity. While process migration is widely used to adapt applications to dynamic execution environments, it is limited by the granularity of the application’s processes. Malleability empowers process migration by allowing the application’s processes to expand or shrink following the availabi...
متن کاملDMTCP: Scalable User-Level Transparent Checkpointing for Cluster Computations
As the size of clusters increases, failures are becoming increasingly frequent. Applications must become fault tolerant if they are to run for extended periods of time. We present DMTCP (Distributed MultiThreaded CheckPointing), the first user-level distributed checkpointing package not dependent on a specific message passing library. This contrasts with existing approaches either specific to l...
متن کاملTransparent User-Level Checkpointing for the Native Posix Thread Library for Linux
Checkpointing of single-threaded applications has been long studied [3], [6], [8], [12], [15]. Much less research has been done for user-level checkpointing of multithreaded applications. Dieter and Lumpp studied the issue for LinuxThreads in Linux 2.2. However, that solution does not work on later versions of Linux. We present an updated solution for Linux 2.6, which uses the more recent NPTL ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1995